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Abstract 

We adapt a well known streaming algorithm for approximating item frequencies to the matrix sketch- 
ing setting. The algorithm receives the rows of a large matrix A 6 R" xm one after the other in a 
streaming fashion. It maintains a sketch matrix B £ K 1 / EXm such that for any unit vector x 

\\Ax\\ 2 > \\Bxf > \\Axf - e\\Af f . 

i— l Sketch updates per row in A require 0(m/e 2 ) operations in the worst case. A slight modification of the 

algorithm allows for an amortized update time of 0(m/e) operations per row. The presented algorithm 
stands out in that it is: deterministic, simple to implement, and elementary to prove. It also experi- 
mentally produces more accurate sketches than widely used approaches while still being computationally 
competitive. 

1 Introduction 

Efficiently obtaining compact approximations, or sketches, of large matrices is a very common and important 

1 ^ i problem. It is essential for obtaining approximate matrix Singular Value Decompositions (SVD) or low rank 

approximations of large matrices. It is used in large scale data mining, Latent Semantic Indexing (LSI), 
k- means clustering, Principal Component Analysis (PCA), Linear regression, solving large linear systems, 
matrix preconditioning, and many other important and commonly performed tasks. 

Moreover, modern large data sets are often viewed as large matrices. For example, textual data in the 
bag-of-words model is stored such that each row in the data matrix corresponds to one document and non 
zero entries correspond to words in the documents. Such matrices are often extremely large and distributed 
across many machines. Sketching methods, therefore, are designed to be pass efficient which means the data 
is read at most a constant number of times. If only one pass is required the computational model is also 
referred to as the streaming model. The streaming model is especially attractive since the analysis can be 
performed immediately when the data is collected. In that case its storage can be eliminated altogether. 

There are three main matrix sketching approaches which are presented here in an arbitrary order. The 
first generates a sparser version of the matrix. Sparser matrices are stored more efficiently and can be mul- 
tiplied by other matrices faster pQ [5] . The second approach is to randomly combine matrix rows [3] @] [5] [S] . 
These methods rely on properties of random low dimensional subspaces and strong concentration of measure 
phenomena. The third is to find a small subset of matrix rows (or columns) which approximate the entire 
matrix. This problem is known as the Column Subset Selection Problem and has been thoroughly investi- 
gated [7][§][2] [TU] [TT] JT3] . Recent results obtain algorithms with almost matching lower bounds [TU] [T3] [T3] ■ 
Alas, It is not immediately clear how to compare these methods' results to ours since their objectives are 
different. They aim to recover a low rank matrix whose column space contains most of space spanned by 
the matrix top k singular vectors. Moreover, most of the above algorithms require several passes over the 
input matrix. A simple streaming solution to the Column Subset Selection problem is obtained by sampling 
columns (rows in this case) from the input matrix. The rows are sampled with probability proportional to 
their squared norm. Despite this algorithm's apparent simplicity, providing tight bounds for its performance 
required over a decade of research [7] [14] [ 15 ] [IB] [17 ] [18] [TT] . 
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This manuscript proposes a forth approach which draws on the matrix sketching problem's similarity to 
the item frequency estimation problem. In what follows, we shortly describe the item frequency approxima- 
tion problem and a well known solution for it which our proposed algorithm will be based on. 

In the item frequency approximation problem there is a universe of m items a\ , . . . , a m and a stream 
A\, . . . , A n of item appearances. The frequency of an item is the number of times it appears in the stream. 
It is trivial to produce these counts using 0(m) space simply by keeping a counter for each item. Our goal is 
to use o(m) space and produce approximate frequencies gj such that \fj — gj\ < en for all j and prescribed 
precision e. 

This problem received an incredibly simple and beautiful solution by |19) . This solution was later and 
independently rediscovered by |20| and [5T] which also improved its update complexity. The algorithm 
simulates the process of repeatedly 'deleting' form the stream £ = 1/e appearances of different items until 
this is no longer possible. In other words, until there are less than £ unique items left. This trimmed stream 
is stored concisely in 0{€) space. The claim is that if item a,j appears in the final trimmed stream gj times 
than gj is a good approximation for its true frequency fj (even if gj = 0) . This is because fj — gj < t where 
t is the number of times items were deleted (item of type Oj is deleted at most once in each deletion batch) . 
Moreover, we delete £ items in every batch and at most n items can be deleted altogether. Thus, tl < n or 
t < en which completes the proof. The reader is referred to |21) for an efficient streaming implementation. 
From this point on we refer to this algorithm as Frequent-Items. 

Let us now describe the item frequency problem as a matrix sketching problem. Let A be a stream of 
indicator vectors A\ € R rn instead of discrete elements. Each row in A is Ai — ej (the j'th standard basis 
vector) if Ai = dj. The frequency fj can be expressed as fj — ||Aej|| 2 . A good approximation matrix B 
would be one such that gj = \\Bej\\ 2 is a good approximation to fj. Replacing n = \\A\\ 2 we get that the 
condition \fj — gj\ < en is equivalent to |||Aej|| 2 — ||Bej|| 2 | < e||A||i. From the above, it is clear that for 
'item indicator' matrices a sketch B € M 1 / £Xm can be obtained by the Frequent- Items algorithm. 

In this paper we suggest Frequent-Directions which is an extension of Frequent-Items to general matrices. 
Given any matrix A € M raxm a sketch B g ]R 1 / £Xm is produced such that: 

MxeW 1 - 1 \\\Ax\\ 2 - \\Bx\\ 2 \ < e\\A\\) or alternatively || A T A - B T B\\ < e ■ tr{A T A) 

The intuition behind Frequent-Directions is surpassingly similar to the one above. In the same way that 
Frequent-Items periodically deletes £ different element, Frequent-Directions periodically removes from its 
sketch £ orthogonal vectors. This means that the Frobenius norm of the trimmed sketch matrix reduces 
by a factor £ faster than its projection on any single direction. Since the final sketch's Frobenius norm is 
non negative, we are guarantied that no direction in space is reduced by 'too much'. This intuition exact 
below. As a remark, when presented with and 'item indicator' matrix Frequent-Directions exactly mimics 
Frequent-Items. 

2 The algorithm 

Claim 1. Let B be the result of applying Algorithm^ to matrix A with prescribed precision parameter e 
then: 

Va; e R d \\Ax\\ 2 > \\Bx\\ 2 > \\Ax\\ 2 - e\\ A\\ }\\x\\ 2 

Or alternatively: 

A T A>B T B and \\A T A — B T B\\ < e ■ tr{A T A) 

Proof. We begin by obtaining the value of X)"=i $i by computing the Frobenius norm of B. Let B l , C l and 
V 1 be the values of B, C and V after the main loop in the algorithm is executed i times. For example, B° 
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Algorithm 1 Frequent-Directions 


Input: £ e (0,1], A £ R nxm 




£ «- |-l/el 




B <— all zeros matrix g ]R £xm 




for i e [n] do 






# denotes the £'th row of £> 


[[/, E, V] <- SVD(B) 


# UEV = B, C/ T C/ = VV T = h 




# E = diag([cri, . . . ,ae\), <n> ■■■>(?£ 


c <- 


# Only for proof notation 


5i «- E 2 ^ 


# E^ the least singular value of B 


E^ ^/max(E 2 -7^,0) 


# le is an £ x £ identity matrix 




# Here Bt contains zeros since E^ = 


Return: S 



is an all zeros matrix and B n is the returned sketch matrix. 

n 

i=l 

n 

= ^[(IIClJ-llB^HP-dlCIJ-llBlJ.)] 

n 

= ^\\A4 2 -tr{C iT &-E^ T B i ) 

n n 

= P||2-^tr(^V) = Pf / -^5 i 

i=l i=l 

The reason that ||C*|| 2 — ||.B* _1 || 2 = \\Ai\\ 2 is because C l is, up to a unitary left rotation, a matrix which 
contains both B % ~ x and Ai. Remember that the last row of B 1 ^ 1 , which Ai replaced, contains only zero 
values. We gain that Ym=i $i = (!l^!l/ — l|B|| 2 )/^ which we use shortly. Now, let us compute the value of 
||Ae|| 2 - \\Bx\\ 2 for a vector x e R d . 

n 

\\A x f-\\Bx\\ 2 = ^[(^^> 2 + ll^- 1 x|| 2 -||^|| 2 ] 

n 

= £[||C^|| 2 - HB'sf ] 

2=1 

n n 

< ^2\\C lT C l - B^ B% ■ \\x\\ 2 = \\x\\ 2 ^Si 

i=l i=l 

The first transition is due to the fact that (Ai,x) 2 + ||B* _1 a;j| 2 = ||C"a;|| 2 . The second transition is correct 
because \\C tT C l - B iT B l \\ = ||<5 < V ,iT y i || = fc. Substituting that £?=i S { = (|| A\\ 2 - \\B\\ 2 )/£, \\B\\ 2 > and 
l/£ < e completes the claim. □ 



2.1 Running time 

Let Tsvd (£, to) stand for the number of operations required to obtain the Singular Value Decomposition of 
an £ by m matrix. The worst case update time of Frequent-Directionsis 0(Tg V Y)(£, m)) which is also 0{m/e 2 ) 
operations per incoming vector. This is because the execution of the main loop is dominated by computating 
the sketch SVD. 
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However, there is no reason to compute the SVD of B in each and every iteration. In Algorithm [T] 
consider replacing the statement Si <— Hj £ with Si <— E 2 ^ rf for some c G [1/10,9/10] and assume for 
simplicity that c£ is integer. In every computation of the SVD the algorithm nullifies (1 — c)£ rows of the 
sketch. Therefore, the next (1 — c)£ rows it receives can be places in zero valued rows. Consequently, the 
SVD of the sketch is computed only once every (1 — c)£ input rows. This gives a total running time of 
0(n/£ ■ Tsw(£,m)) which is amortized 0(mje) per row. The proof above carries over almost without a 
change. The resulting approximation is slightly weakened though. The modified algorithm only guaranties 
that \\A T A - B T B\\ < tr{A T A)/c£. 

Remark 1. From a theoretical stand point 0(TsyY>(£,m)) can be reduced using fast matrix multiplication. 
Let a denote the smallest scalar such that multiplying two £ x £ matrices reguires 0{£ 2+a ) operations. To 
compute the SVD of B we first use fast matrix multiplication to compute BB T in time 0{m£ 1+a ). Then we 
compute [U, S 2 ,U T ] = SVD(BB T ) which requires 0(£ 3 ) operations. Finally, we compute the right singular 
vectors of B by V — S~ 1 U T B which, using fast matrix multiplication, requires 0(m£ 1+a ). This reduces the 
theoretical computation time of the SVD to 0(m£ 1+a + £ 3 ). Note that a < 0.5 ]2I^ . Although computing 
the SVD in this manner is numerically unstable and generally recommended against, in this case it might 
be beneficial. Due to the algorithm's relatively weak approximation guaranty the accuracy loss incurred by 
squaring the matrix condition number might not be meaningful. Moreover, there is no need for the SVD 
procedure to converge to machine precision. 



2.2 Parallelization and sketching sketches 

A convenient property of this sketching technique is that it allows for combining sketches. In other words, 
let A\ and A 2 denote two halves of a larger matrix A. Also, let B\ and B 2 be the sketches computed by the 
above technique for A\ and A2 respectively. Now let the final sketch, C, be the sketch of a matrix B which 
contains both B 1 and B 2 . It still holds that \\A T A - C T C\\ < e ■ tr(A T A - C T C). To see this we compute 
||Cx|| 2 for a test vector ||a;|| = 1. 

\\Cxf > \\Bx\\ 2 -s(\\B\\ 2 f-\\C\\ 2 f ) 

= \\B lX \\ 2 + \\B 2 x\\ 2 - e{\\B x f f + \\B 2 \\ 2 ) + e\\C\\ 2 

I/" 11*11/) 

1/ - ll B 2||/) 

I/ + \\B 2 f f ) 
\ 2 + \\A 2 x\\ 2 -e(\\A " J 1 ' J ■ '-"' J 

2 _/|| a 1 1 2 ||/~<||2n 



> WA.xf-eiWA^-WB.W 2 ) 

+ \\A 2 x\\ 2 -e(\\A 2 \\ 2 -\\B 2 \\ 2 ) 

-eiWB^j + WB^ + eWCW 2 
= \\A lX \\ 2 + \\A 2 x\\ 2 - eiWA^l 2 + \\A 2 \\}) +e\\C\\ f 



= \\Ax\\ 2 -e(\\A\\j-\\C\\j) 

Here we use the fact that ||i?ia;|| 2 > ||v4ix|| 2 — e(||Ai|j 2 — ||Bi|| 2 ) for ||a;|| = 1 which is a consequence of the 
derivation above. This property is especially useful when the matrix (or data) is distributed across many 
machines which is often the case in modern large scale data. 



2.3 Connection to matrix low rank approximation 

Low rank approximation of matrices is a well studied problem. The goal is to obtain a small matrix B 
containing £ columns which contains in its columns space a matrix n of rank k such that ||j4 — AII||£ < 
(1 + e)\\A — Here, Ak is the best rank k approximation of A and £ is either 2 (spectral norm) or / 

(Frobenius norm). It is difficult to compare our algorithm to this line of work since the types of bounds 
sought are qualitatively different. We remark, however, that it is possible to use Frequent-Directions to 
produce a low rank approximation result. 

Lamma 4 from [15j (modified). Let denote the projection matrix on the left k singular vectors of B 
corresponding to its largest singular values. Then the following holds \\A — AP^\\ 2 < cr\ +1 + 2\\A T A — B T B\\ 
where ak+i is the (fc + l)'th singular value of A. 
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Therefore, if 2\\A T A - B T B\\ < e<rl +1 we have that \\A - APj?\\ < a k+1 (l + e) which is a 1 + e 
approximation to the optimal solution. Letting the sketch B maintain I > 2\\A\\ 2 / ea\ +1 ensures that this 
is the case. Since ||A||^/ct| +1 € O(fc) this is asymptotically inferior to the space requirement of [T2]. That 
said, if ||^4||//o"fc_|_i € 0(k) this is also optimal due to [13] . 



3 Experiments 

We compere Frequent-Directions to five different techniques. The first two constitute a brute force and a naive 
baseline. The other three are common algorithms which are commonly used in practice. Namely: sampling, 
hashing, and random projection. These produce sketch matrices B £ Jj« xm such that E[B T B] = A T A. The 
tested methods are limited in storage to axi£xm sketch matrix B and additional auxiliary variables in o(£m) 
space. This is with the exception of the brute force algorithm. For a given input matrix A we compare the 
methods' computational efficiency and resulting sketch accuracy. The computational efficiency is taken as 
the time required to produce B from the stream of A's rows. The accuracy of a sketch matrix B is measured 
by \\A T A-B T B\\. 

Brute Force: the brute force approach produces the optimal rank £ approximation of A. It explicitly 
computes the matrix A T A = AfAi by aggregating the outer products of the rows of A. The final 
'sketch' consists of the top I right singular vectors and values (square rooted) of A T A which are obtained by 
computing its SVD. The update time per row in A is 0(m 2 ) and space requirement is 0(m 2 ). 
Nai've: upon receiving a row in A the naive method docs nothing. The sketch it returns is an all zeros £ by 
m matrix. This baseline is important for two reasons. First, it can actually be more accurate than random 
methods due to under sampling scaling issues. Second, although it does not perform any computation is 
does incur computation overheads and I/O exactly like the other methods. It is therefore an important 
benchmark in both accuracy and running time measurements. 

Sampling: each row in the sketch matrix B samp is chosen i.i.d. from Ai and rescaled. More accurately, each 
row Bj amp takes the value ^ jj^jj Ai with probability pi — ||Aj|| 2 /|| The space it requires is 0(m£) 

in the worst case but it can be much lower if the chosen rows are sparse. Here this is implemented as I 
independent reservoir samplers, each sampling one row according to the distribution. The update running 
time is therefore, 0(£ + m) per row in A. 

Hashing: The matrix B hash is generated by adding or subtracting the rows of A from random rows of 
B hash . More accurately, B hash is initialized to be an I by m all zeros matrix. Then, when processing Ai 
we perform B^^ h <— B^^ h + s(z)Aj. Here h : [n] [£] and s : [n] —> { — 1, 1} are perfect hash functions. 
There is no harm in assuming such functions exist since complete randomness is naively possible without 
dominating either space or running time. This method is often used in practice by the machine learning 
community and is referred to as 'feature hashing' or 'hashing trick' (23| . 

Random Projection: The matrix B pTO i is equivalent to the matrix RA where R is an £ by d matrix such 
that Rij G {-1/V?, 1/v^} uniformly. Since R is a random projection matrix |24j B pro ^ contains the m 
columns of A randomly projected from dimension n to dimension £. This is easily computed in a streaming 
fashion while requiring at most 0{m£) space and 0(m€) operation per row updated. For proofs of correctness 
and usage see [3] [I] [5] [5] . 



Frequent-Directions: This indicates the modified algorithm described in Section 2.1 with c = 1/3. So, 
while it requires a sketch matrix of size £ X m it night actually return a sketch of rank £/3. Moreover, it only 
guaranties that ||A T A — B T B\\ < 3 ■ tr(A T A) / £. The benefit, however, is that its amortized running time is 
0(m£) per row. 

The generated input matrices A contains d dimensional signal and m dimension noise. More accurately 
A = SDU + N/C The signal coefficients matrix S G R nxd is such that S itj - W(0, 1) i.i.d. The diagonal 
matrix D is Di : i = 1 — (i — l)/d which gives linearly diminishing signal singular values. The signal row space 
matrix U G ]g dxm contains a random d dimensional subspace in M. m , for clarity, UU T = Id- The matrix 
SDU is exactly rank d and constitutes the signal we wish to recover. The matrix N 6 I" xm contributes 
additive Gaussian noise Nij ~ A/"(0, 1). Due to [3S], the spectral norms of SDU and N are expected to be 
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the same up to some constant. Experimentally, this constant is close to 1. Therefore, when the signal to 
noise ratio £ is close to (or less than) 1 we cannot expect to approximate A T A since the noise dominates the 
signal. On the other hand, when £ € oj(1) the spectral norm is dominated by the signal which is therefore 
recoverable. As a remark, note that the Frobenius norm of A is dominated by the noise for £ <= o(y/m/d). 

The values used in the experiments are n = 10, 000, m = 1000, 1 = [10, 20, ... , 300], d = [5, 10, 20, 50, 100], 
C = [1, 2, . . . , 15]. Each method produced a sketch for each matrix, A, which is generated according to 
every parameter combination. Each resulting sketch B was measured for accuracy which is defined as 
||A T A — B T B\\. The running time for producing each sketch by the different methods was also measured. 
The entire experiment was repeated 7 times and the reported results are median values of these independent 
executions. The experiments were conducted on a FreeBSD machine with 50GB RAM, and 12MB cache 
using a single Intel(R) Xeon(R) X5650 CPU. Example results are plotted and explained in Figures [l] [2] and 

m 
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Figure 1: Accuracy vs. sketch size. The y-axis indicates the accuracy of the sketches. If a method returns a 
sketch matrix B the accuracy is measured by \\A T A-B T B\\. The size of the sketch is fixed for all methods and 
is B 6 M^ xm . The value of i is indicated on the re-axis. The form of the input matrix is explained in Section|3] 
Here the signal dimension is d = 50 and the signal to noise ratio is C = 10- There are a few interesting 
observations here. First, all three random techniques actually perform worse than naive for small sketch 
sizes. This is not the case with Frequent-Directions. Second, the three random techniques perform equally 
well. This might be a result of the chosen input. Nevertheless, practitioners should consider these methods 
as comparable alternatives. The sketching technique performs significantly better than all three. The curve 
indicated by "Frequent-Directions Bound" plots the accuracy guarantied by Frequent-Directions. While the 
worst case bound for Frequent-Directions is consistently better than the three competing techniques it is still 
far from being tight for Frequent-Directions for large sketch sizes. Notice that Frequent-Directions reaches 
its best result already at I = 150. This is because the signal dimension is d = 50 and Frequent-Directionsis 
implemented with parameter c = 1/3 (see Section 2.1 1. 



4 Discussion 

This paper draws upon a surprising similarity between two problems, the item frequency approximation 
problem and the matrix sketching problem. It seems that, in general, solutions to the first can be modified 
to solve the second but incur an additional factor of m in both running time and space requirement. This is 
true, for example, about sampling. It is also the case for the memory footprint of Frequent-Items which is 
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Figure 2: Running time in seconds vs. sketch size. Each method produces a sketch matrix B of size £ x 1000 
for a dense 10, 000 x 1, 000 matrix. The value of £ is indicated by the x-axis. The total amount of computation 
time required to produce the sketch is indicated on the j/-axis in seconds. The brute force method computes 
the complete SVD of A and therefore its running time is independent of I. Note that Hashing is almost as 
fast as the Naive method and independent of £ which is expected. The rest of the methods exhibit a linear 
dependence on £ which is also expected. Surprisingly though, sketching is more computationally efficient 
than random projections although both require 0(m£) operations per input row. It is important to stress 
that the implementations above are not very well optimized. Different implementations might lead to slightly 
different results. 
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Figure 3: Accuracy vs. signal to noise ratio. Here, all the sketches are of the same size £ = 100. The 
sketches accuracies are measured as before and indicated on the y-axis. The x-axis gives the value of £ the 
signal to noise ratio. Here the signal dimension is d = 50 and thus the brute force approach, which gives 
the best rank 100 approximation of A, should theoretically recover A almost exactly. Note, however, that 
when £ = 1 the best rank 100 approximation of the matrix is quite poor. That means that spectrum of A 
is dominated by the noise. As £ grows, the noise diminishes and the rows of A become more concentrated 
around the d dimensional signal row space. Note that all the sketching techniques' results improve when the 
noise diminishes, as expected. 
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0(l/e) while for Frequent-Directions it is 0(m/e). But, the update time of Frequent-Items is O(l) and that 
of Frequent-Directions is 0(m/e). It is natural to seek a modified algorithm which exhibits an 0(m) update 
time. Another question is whether more advanced algorithms for fining frequent items in streams could also 
be carried over. A good candidate is the Count Sketch algorithm [2B]. Alas, it depends on item hashing in 
a way which does not naturally translate to the matrix sketching domain. 

Acknowledgments: The author truly thanks Petros Drineas, Jelani Nelson, Nir Ailon, Zohar Karnin, and 
Yoel Shkolnisky for very helpful discussions and pointers. 
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